Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 6 de 6
Filtrar
1.
EBioMedicine ; 102: 105075, 2024 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-38565004

RESUMEN

BACKGROUND: AI models have shown promise in performing many medical imaging tasks. However, our ability to explain what signals these models have learned is severely lacking. Explanations are needed in order to increase the trust of doctors in AI-based models, especially in domains where AI prediction capabilities surpass those of humans. Moreover, such explanations could enable novel scientific discovery by uncovering signals in the data that aren't yet known to experts. METHODS: In this paper, we present a workflow for generating hypotheses to understand which visual signals in images are correlated with a classification model's predictions for a given task. This approach leverages an automatic visual explanation algorithm followed by interdisciplinary expert review. We propose the following 4 steps: (i) Train a classifier to perform a given task to assess whether the imagery indeed contains signals relevant to the task; (ii) Train a StyleGAN-based image generator with an architecture that enables guidance by the classifier ("StylEx"); (iii) Automatically detect, extract, and visualize the top visual attributes that the classifier is sensitive towards. For visualization, we independently modify each of these attributes to generate counterfactual visualizations for a set of images (i.e., what the image would look like with the attribute increased or decreased); (iv) Formulate hypotheses for the underlying mechanisms, to stimulate future research. Specifically, present the discovered attributes and corresponding counterfactual visualizations to an interdisciplinary panel of experts so that hypotheses can account for social and structural determinants of health (e.g., whether the attributes correspond to known patho-physiological or socio-cultural phenomena, or could be novel discoveries). FINDINGS: To demonstrate the broad applicability of our approach, we present results on eight prediction tasks across three medical imaging modalities-retinal fundus photographs, external eye photographs, and chest radiographs. We showcase examples where many of the automatically-learned attributes clearly capture clinically known features (e.g., types of cataract, enlarged heart), and demonstrate automatically-learned confounders that arise from factors beyond physiological mechanisms (e.g., chest X-ray underexposure is correlated with the classifier predicting abnormality, and eye makeup is correlated with the classifier predicting low hemoglobin levels). We further show that our method reveals a number of physiologically plausible, previously-unknown attributes based on the literature (e.g., differences in the fundus associated with self-reported sex, which were previously unknown). INTERPRETATION: Our approach enables hypotheses generation via attribute visualizations and has the potential to enable researchers to better understand, improve their assessment, and extract new knowledge from AI-based models, as well as debug and design better datasets. Though not designed to infer causality, importantly, we highlight that attributes generated by our framework can capture phenomena beyond physiology or pathophysiology, reflecting the real world nature of healthcare delivery and socio-cultural factors, and hence interdisciplinary perspectives are critical in these investigations. Finally, we will release code to help researchers train their own StylEx models and analyze their predictive tasks of interest, and use the methodology presented in this paper for responsible interpretation of the revealed attributes. FUNDING: Google.


Asunto(s)
Algoritmos , Catarata , Humanos , Cardiomegalia , Fondo de Ojo , Inteligencia Artificial
3.
Nature ; 620(7972): 172-180, 2023 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-37438534

RESUMEN

Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model1 (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM2 on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA3, MedMCQA4, PubMedQA5 and Measuring Massive Multitask Language Understanding (MMLU) clinical topics6), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.


Asunto(s)
Benchmarking , Simulación por Computador , Conocimiento , Medicina , Procesamiento de Lenguaje Natural , Sesgo , Competencia Clínica , Comprensión , Conjuntos de Datos como Asunto , Concesión de Licencias , Medicina/métodos , Medicina/normas , Seguridad del Paciente , Médicos
4.
Lancet Digit Health ; 5(5): e257-e264, 2023 05.
Artículo en Inglés | MEDLINE | ID: mdl-36966118

RESUMEN

BACKGROUND: Photographs of the external eye were recently shown to reveal signs of diabetic retinal disease and elevated glycated haemoglobin. This study aimed to test the hypothesis that external eye photographs contain information about additional systemic medical conditions. METHODS: We developed a deep learning system (DLS) that takes external eye photographs as input and predicts systemic parameters, such as those related to the liver (albumin, aspartate aminotransferase [AST]); kidney (estimated glomerular filtration rate [eGFR], urine albumin-to-creatinine ratio [ACR]); bone or mineral (calcium); thyroid (thyroid stimulating hormone); and blood (haemoglobin, white blood cells [WBC], platelets). This DLS was trained using 123 130 images from 38 398 patients with diabetes undergoing diabetic eye screening in 11 sites across Los Angeles county, CA, USA. Evaluation focused on nine prespecified systemic parameters and leveraged three validation sets (A, B, C) spanning 25 510 patients with and without diabetes undergoing eye screening in three independent sites in Los Angeles county, CA, and the greater Atlanta area, GA, USA. We compared performance against baseline models incorporating available clinicodemographic variables (eg, age, sex, race and ethnicity, years with diabetes). FINDINGS: Relative to the baseline, the DLS achieved statistically significant superior performance at detecting AST >36·0 U/L, calcium <8·6 mg/dL, eGFR <60·0 mL/min/1·73 m2, haemoglobin <11·0 g/dL, platelets <150·0 × 103/µL, ACR ≥300 mg/g, and WBC <4·0 × 103/µL on validation set A (a population resembling the development datasets), with the area under the receiver operating characteristic curve (AUC) of the DLS exceeding that of the baseline by 5·3-19·9% (absolute differences in AUC). On validation sets B and C, with substantial patient population differences compared with the development datasets, the DLS outperformed the baseline for ACR ≥300·0 mg/g and haemoglobin <11·0 g/dL by 7·3-13·2%. INTERPRETATION: We found further evidence that external eye photographs contain biomarkers spanning multiple organ systems. Such biomarkers could enable accessible and non-invasive screening of disease. Further work is needed to understand the translational implications. FUNDING: Google.


Asunto(s)
Aprendizaje Profundo , Retinopatía Diabética , Humanos , Estudios Retrospectivos , Calcio , Retinopatía Diabética/diagnóstico , Biomarcadores , Albúminas
5.
J Diabetes Res ; 2020: 8839376, 2020.
Artículo en Inglés | MEDLINE | ID: mdl-33381600

RESUMEN

OBJECTIVE: To evaluate diabetic retinopathy (DR) screening via deep learning (DL) and trained human graders (HG) in a longitudinal cohort, as case spectrum shifts based on treatment referral and new-onset DR. METHODS: We randomly selected patients with diabetes screened twice, two years apart within a nationwide screening program. The reference standard was established via adjudication by retina specialists. Each patient's color fundus photographs were graded, and a patient was considered as having sight-threatening DR (STDR) if the worse eye had severe nonproliferative DR, proliferative DR, or diabetic macular edema. We compared DR screening via two modalities: DL and HG. For each modality, we simulated treatment referral by excluding patients with detected STDR from the second screening using that modality. RESULTS: There were 5,738 patients (12.3% STDR) in the first screening. DL and HG captured different numbers of STDR cases, and after simulated referral and excluding ungradable cases, 4,148 and 4,263 patients remained in the second screening, respectively. The STDR prevalence at the second screening was 5.1% and 6.8% for DL- and HG-based screening, respectively. Along with the prevalence decrease, the sensitivity for both modalities decreased from the first to the second screening (DL: from 95% to 90%, p = 0.008; HG: from 74% to 57%, p < 0.001). At both the first and second screenings, the rate of false negatives for the DL was a fifth that of HG (0.5-0.6% vs. 2.9-3.2%). CONCLUSION: On 2-year longitudinal follow-up of a DR screening cohort, STDR prevalence decreased for both DL- and HG-based screening. Follow-up screenings in longitudinal DR screening can be more difficult and induce lower sensitivity for both DL and HG, though the false negative rate was substantially lower for DL. Our data may be useful for health-economics analyses of longitudinal screening settings.


Asunto(s)
Aprendizaje Profundo , Retinopatía Diabética/diagnóstico por imagen , Fondo de Ojo , Interpretación de Imagen Asistida por Computador , Edema Macular/diagnóstico por imagen , Tamizaje Masivo , Fotograbar , Anciano , Proliferación Celular , Retinopatía Diabética/epidemiología , Femenino , Humanos , Incidencia , Estudios Longitudinales , Edema Macular/epidemiología , Masculino , Persona de Mediana Edad , Programas Nacionales de Salud , Valor Predictivo de las Pruebas , Prevalencia , Reproducibilidad de los Resultados , Índice de Severidad de la Enfermedad , Tailandia/epidemiología
6.
Ophthalmology ; 126(12): 1627-1639, 2019 12.
Artículo en Inglés | MEDLINE | ID: mdl-31561879

RESUMEN

PURPOSE: To develop and validate a deep learning (DL) algorithm that predicts referable glaucomatous optic neuropathy (GON) and optic nerve head (ONH) features from color fundus images, to determine the relative importance of these features in referral decisions by glaucoma specialists (GSs) and the algorithm, and to compare the performance of the algorithm with eye care providers. DESIGN: Development and validation of an algorithm. PARTICIPANTS: Fundus images from screening programs, studies, and a glaucoma clinic. METHODS: A DL algorithm was trained using a retrospective dataset of 86 618 images, assessed for glaucomatous ONH features and referable GON (defined as ONH appearance worrisome enough to justify referral for comprehensive examination) by 43 graders. The algorithm was validated using 3 datasets: dataset A (1205 images, 1 image/patient; 18.1% referable), images adjudicated by panels of GSs; dataset B (9642 images, 1 image/patient; 9.2% referable), images from a diabetic teleretinal screening program; and dataset C (346 images, 1 image/patient; 81.7% referable), images from a glaucoma clinic. MAIN OUTCOME MEASURES: The algorithm was evaluated using the area under the receiver operating characteristic curve (AUC), sensitivity, and specificity for referable GON and glaucomatous ONH features. RESULTS: The algorithm's AUC for referable GON was 0.945 (95% confidence interval [CI], 0.929-0.960) in dataset A, 0.855 (95% CI, 0.841-0.870) in dataset B, and 0.881 (95% CI, 0.838-0.918) in dataset C. Algorithm AUCs ranged between 0.661 and 0.973 for glaucomatous ONH features. The algorithm showed significantly higher sensitivity than 7 of 10 graders not involved in determining the reference standard, including 2 of 3 GSs, and showed higher specificity than 3 graders (including 1 GS), while remaining comparable to others. For both GSs and the algorithm, the most crucial features related to referable GON were: presence of vertical cup-to-disc ratio of 0.7 or more, neuroretinal rim notching, retinal nerve fiber layer defect, and bared circumlinear vessels. CONCLUSIONS: A DL algorithm trained on fundus images alone can detect referable GON with higher sensitivity than and comparable specificity to eye care providers. The algorithm maintained good performance on an independent dataset with diagnoses based on a full glaucoma workup.


Asunto(s)
Aprendizaje Profundo , Glaucoma de Ángulo Abierto/diagnóstico , Oftalmólogos , Disco Óptico/patología , Enfermedades del Nervio Óptico/diagnóstico , Especialización , Anciano , Área Bajo la Curva , Conjuntos de Datos como Asunto , Femenino , Humanos , Masculino , Persona de Mediana Edad , Fibras Nerviosas/patología , Curva ROC , Derivación y Consulta , Células Ganglionares de la Retina/patología , Estudios Retrospectivos , Sensibilidad y Especificidad
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...